1 📖 Introduction

This comprehensive tutorial demonstrates the implementation of RNA velocity analysis using scVelo(Bergen et al. 2020), a state-of-the-art Python package for single-cell RNA sequencing (scRNA-seq) data analysis. Published in Nature Biotechnology (2020), scVelo represents a significant advancement in transcriptional dynamics analysis, building upon and enhancing the foundational RNA velocity framework while addressing limitations of its predecessor, velocyto.

RNA velocity(La Manno et al. 2018) analysis employs sophisticated mathematical modeling of transcriptional kinetics to unveil cellular dynamics that traditional scRNA-seq approaches cannot directly capture. This powerful methodology enables researchers to: 1. Determine the transcriptional state of genes (induction or repression) within specific cell populations 2. Predict cellular trajectories and fate decisions through pseudotime analysis 3. Reconstruct developmental hierarchies with unprecedented temporal resolution

By leveraging the analytical capabilities of scVelo, we can gain deep insights into cellular differentiation processes, developmental trajectories, and dynamic gene regulation in complex biological systems.

This tutorial integrates best practices from the official scVelo documentation while incorporating advanced analytical approaches for comprehensive single-cell analysis.

2 📦 Load libraries

library(CellChat)
library(ggalluvial)
library(Matrix)
library(NMF)
library(patchwork)
library(pdftools)
library(reticulate)
library(Seurat)
library(svglite)
library(tidyverse)

set.seed(12345)

options(stringsAsFactors = FALSE)

reticulate::use_python("C:/Users/jwang/AppData/Local/Programs/Python/Python313/python.exe", required=T)

setwd("D:/Data_Analysis_Practice_2025/scRNAseq/scRNAseq_Analysis_templates")

2.1 🔄 Data Format Conversion: Seurat to AnnData Integration

Our analysis begins with a curated mouse brain dataset encompassing both disease and control samples. The initial data processing pipeline, executed in R using the Seurat framework(Hao et al. 2021), included: - Quality control and cell filtering - Data normalization and scaling - Dimensionality reduction - Batch effect correction - Unsupervised clustering analysis

A critical step in our workflow involves converting the processed Seurat object to AnnData format, the primary data structure utilized by scVelo. While the SeuratDisk package offers direct conversion capabilities, we implement a more robust custom conversion pipeline to ensure data integrity and compatibility. This approach provides greater control over the transfer of key analytical components between the two frameworks.

2.2 assuming that you have some Seurat object called seurat_obj:

# save metadata table:
seurat_obj$barcode <- colnames(seurat_obj)
seurat_obj$UMAP_1 <- seurat_obj@reductions$umap@cell.embeddings[,1]
seurat_obj$UMAP_2 <- seurat_obj@reductions$umap@cell.embeddings[,2]
write.csv(seurat_obj@meta.data, file='metadata.csv', quote=F, row.names=F)

# write expression counts matrix

counts_matrix <- GetAssayData(seurat_obj, assay='RNA', slot='counts')
writeMM(counts_matrix, file=paste0(out_data_dir, 'counts.mtx'))

# write dimesnionality reduction matrix, in this example case pca matrix
write.csv(seurat_obj@reductions$pca@cell.embeddings, file='pca.csv', quote=F, row.names=F)

# write gene names
write.table(
  data.frame('gene'=rownames(counts_matrix)),file='gene_names.csv',
  quote=F,row.names=F,col.names=F
)

2.3 write expression counts matrix:

import scanpy as sc
import anndata
from scipy import io
from scipy.sparse import coo_matrix, csr_matrix
import numpy as np
import os
import pandas as pd

# load sparse matrix:
X = io.mmread("count/counts.mtx")

# create anndata object
adata = anndata.AnnData(
    X=X.transpose().tocsr()
)

# load cell metadata:
cell_meta = pd.read_csv("count/metadata.csv")

# load gene names:
with open("gene_names.csv", 'r') as f:
    gene_names = f.read().splitlines()

# set anndata observations and index obs by barcodes, var by gene names
adata.obs = cell_meta
adata.obs.index = adata.obs['count/barcode']
adata.var.index = gene_names

# load dimensional reduction:
pca = pd.read_csv("count/pca.csv")
pca.index = adata.obs.index

# set pca and umap
adata.obsm['X_pca'] = pca.to_numpy()
adata.obsm['X_umap'] = np.vstack((adata.obs['UMAP_1'].to_numpy(), adata.obs['UMAP_2'].to_numpy())).T

# plot a UMAP colored by sampleID to test:
sc.pl.umap(adata, color=['SampleID'], frameon=False, save=True)

# save dataset as anndata format
adata.write('my_data.h5ad')

# reload dataset
adata = sc.read_h5ad('my_data.h5ad')

3 🧬 Generation of Spliced and Unspliced Count Matrices

RNA velocity analysis requires the distinction between spliced and unspliced transcripts, necessitating a specialized quantification approach beyond traditional UMI-based count matrices. This section details the construction of these distinct matrices using the velocyto command-line interface (CLI).

While both velocyto and Kallisto-Bustools are capable of generating these specialized matrices, we implement velocyto due to its: 1. Robust integration with Cell Ranger outputs 2. Flexible compatibility with diverse single-cell platforms 3. Established track record in RNA velocity analysis

The velocyto CLI offers seamless processing of: - Direct Cell Ranger output directories - Generic BAM files from any single-cell platform - Species-specific annotations (utilizing mm10 reference in this implementation) - Optional repeat masking for enhanced accuracy (recommended for optimal results)

This approach ensures accurate quantification of splicing states, crucial for downstream velocity calculations.

#!/bin/bash

#SBATCH --job-name=velocyto_run
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=4
#SBATCH --mem=240G
#SBATCH --time=24:00:00
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out 
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err 

transcriptome="refdata-gex-GRCh38-2024-A/genes/genes.gtf"
BARCODES="cellranger_results/11C1_BPN1/outs/filtered_feature_bc_matrix/barcodes.tsv"
BAMFILE="cellranger_results/11C1_BPN1/outs/possorted_genome_bam.bam"

velocyto run -b $BARCODES -o ./velocyto $BAMFILE $transcriptome

4 📥 Load data into Python

Now that we have our input data properly formatted, we can load it into python. Velocyto created a separate spliced and unspliced matrix for each sample, so we first have to merge the different samples into one object. Additionally, I am reformatting the cell barcodes to match my anndata object with the full genes-by-cells data.

import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np
import pandas as pd
import anndata as ad
scv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo', facecolor='white', dpi=100, frameon=False)
cr.settings.verbosity = 2

adata = sc.read_h5ad('count/my_data.h5ad')

4.1 load loom files

ldata_11C1_BPN1 = sc.read('count/possorted_genome_bam_11C1_BPN1.loom', cache=True)
ldata_11C1_BPN2 = sc.read('count/possorted_genome_bam_11C1_BPN2.loom', cache=True)
ldata_11C1_VEH1 = sc.read('count/possorted_genome_bam_11C1_VEH1.loom', cache=True)
ldata_11C1_VEH2 = sc.read('count/possorted_genome_bam_11C1_VEH2.loom', cache=True)
ldata_FXS_BPN1 = sc.read('count/possorted_genome_bam_FXS_BPN1.loom', cache=True)
ldata_FXS_BPN2 = sc.read('count/possorted_genome_bam_FXS_BPN2.loom', cache=True)
ldata_FXS_VEH1 = sc.read('count/possorted_genome_bam_FXS_VEH1.loom', cache=True)
ldata_FXS_VEH2 = sc.read('count/possorted_genome_bam_FXS_VEH2.loom', cache=True)

# rename barcodes in order to  merge
barcodes = [bc.split(':')[1] for bc in ldata_11C1_BPN1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_1' for bc in barcodes]
ldata_11C1_BPN1.obs.index = barcodes
print(ldata_11C1_BPN1.layers.keys())  # Should show 'spliced' and 'unspliced'


barcodes = [bc.split(':')[1] for bc in ldata_11C1_BPN2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_3' for bc in barcodes]
ldata_11C1_BPN2.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata_11C1_VEH1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_4' for bc in barcodes]
ldata_11C1_VEH1.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata_11C1_VEH2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_7' for bc in barcodes]
ldata_11C1_VEH2.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata_FXS_BPN1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_2' for bc in barcodes]
ldata_FXS_BPN1.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata_FXS_BPN2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_5' for bc in barcodes]
ldata_FXS_BPN2.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata_FXS_VEH1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_6' for bc in barcodes]
ldata_FXS_VEH1.obs.index = barcodes

barcodes = [bc.split(':')[1] for bc in ldata_FXS_VEH2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_8' for bc in barcodes]
ldata_FXS_VEH2.obs.index = barcodes

# make variable names unique
ldata_11C1_BPN1.var_names_make_unique()
ldata_11C1_BPN2.var_names_make_unique()
ldata_11C1_VEH1.var_names_make_unique()
ldata_11C1_VEH2.var_names_make_unique()
ldata_FXS_BPN1.var_names_make_unique()
ldata_FXS_BPN2.var_names_make_unique()
ldata_FXS_VEH1.var_names_make_unique()
ldata_FXS_VEH2.var_names_make_unique()

# concatenate the eight loom files
ldata = ldata_11C1_BPN1.concatenate([ldata_11C1_BPN2, ldata_11C1_VEH1, ldata_11C1_VEH2, ldata_FXS_BPN1, ldata_FXS_BPN2, ldata_FXS_VEH1, ldata_FXS_VEH2])
print(adata.layers.keys())  # Should show 'spliced' and 'unspliced'

# merge matrices into the original adata object
# Ensure we keep spliced/unspliced counts
adata = scv.utils.merge(adata, ldata, copy=True)
adata.write('count/merged_data.h5ad')
adata = sc.read_h5ad('count/merged_data.h5ad')
print(adata)

4.2 plot umap to check

sc.pl.umap(adata, color='annotated_celltype', frameon=False, legend_loc='on data', title='', save='_annotated_celltypes.pdf')
umap_annotated_celltypes

Figure 1. umap_annotated_celltypes

5 🔄 RNA Velocity Computation and Model Implementation

With our data properly structured, we proceed to the core analytical phase: RNA velocity computation using scVelo’s sophisticated modeling framework. The package offers two distinct mathematical approaches:

  1. Steady-State Model (2018):
    • Based on the original RNA velocity framework
    • Assumes constant transcriptional regulation
    • Suitable for stable cellular states
  2. Dynamical Model (2020):
    • Incorporates temporal gene regulation changes
    • Accounts for transient cellular states
    • Provides enhanced resolution of developmental dynamics

Our analysis pipeline begins with a thorough inspection of transcriptional states across cell clusters, followed by rigorous preprocessing steps. Initially, we implement the steady-state model with stochastic optimization to establish baseline velocity estimates.

scv.pl.proportions(adata, groupby='annotated_celltype',save='figures/proportions.pdf')
proportions

Figure 2. proportions

5.1 pre-process

scv.pp.filter_and_normalize(adata)
scv.pp.moments(adata)

5.2 compute velocity

scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)
adata.write('count/after_computing_velocity_merged_data.h5ad')
adata = sc.read_h5ad('count/after_computing_velocity_merged_data.h5ad')

6 👀 Visualization of RNA Velocity Vector Fields

The complex dynamics of cellular transcriptional states can be elegantly visualized through RNA velocity vector fields. By projecting velocity vectors onto two-dimensional embeddings (typically UMAP or t-SNE), we can observe: - Directional trends in gene expression changes - Potential developmental trajectories - Transition states between cell populations - Local and global patterns of transcriptional dynamics

These visualizations integrate velocity information across all genes and cells, providing a comprehensive view of cellular dynamics in the reduced dimensional space.

scv.pl.velocity_embedding(adata, basis='umap', frameon=False, save='embedding.pdf')
scvelo_embedding

Figure 3. scvelo_embedding

scv.pl.velocity_embedding_grid(adata, basis='umap', color='annotated_celltype', save='embedding_grid.pdf', title='', scale=0.25)
scvelo_embedding_grid

Figure 4. scvelo_embedding_grid

scv.pl.velocity_embedding_stream(adata, basis='umap', color=['annotated_celltype', 'orig.ident'], save='embedding_stream.pdf', title='')
scvelo_embedding_stream

Figure 5. scvelo_embedding_stream

6.1 🖼️ plot velocity of a selected gene

scv.pl.velocity(adata, var_names=['SLC17A6'], color='annotated_celltype', save='SLC17A6_stream.pdf')
scvelo_SLC17A6_stream

Figure 6. scvelo_SLC17A6_stream

7 ⏬ Advanced Downstream Analysis

The RNA velocity framework enables sophisticated downstream analyses to extract biological insights from transcriptional dynamics. Our analytical pipeline encompasses:

  1. Dynamic Gene Identification:
    • Detection of genes with significant velocity patterns
    • Ranking of genes by their dynamic behavior
    • Assessment of transcriptional regulation strength
  2. Velocity Coherence Analysis:
    • Quantification of coordinated transcriptional changes
    • Evaluation of local cell-state transitions
    • Identification of developmental decision points
  3. Trajectory Analysis:
    • Pseudotime inference from velocity patterns
    • Reconstruction of developmental hierarchies
    • Prediction of cell-fate determinants
  4. Graph-based Topology:
    • Implementation of partition-based graph abstraction (PAGA)(Wolf et al. 2019)
    • Orientation of cellular hierarchies
    • Mapping of lineage relationships

These analyses collectively provide a comprehensive understanding of cellular dynamics and developmental trajectories.

scv.tl.rank_velocity_genes(adata, groupby='annotated_celltype', min_corr=.3)

# scv.DataFrame was removed in the past. Please use pandas.DataFrame, instead.
df = pd.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()
df.to_csv('count/rank_velocity_genes.csv', index=False) 
df = pd.read_csv('count/rank_velocity_genes.csv')

kwargs = dict(frameon=False, size=10, linewidth=1.5,
              add_outline='excitatory neurons, radial glia')

scv.pl.scatter(adata, df['excitatory neurons'][:3], ylabel='excitatory neurons', frameon=False, color='annotated_celltype', size=10, linewidth=1.5, save="excitatory_neurons_scatter.pdf")
excitatory_neurons_scatter

Figure 7. excitatory_neurons_scatter

scv.pl.scatter(adata, df['radial glia'][:3], ylabel='radial glia', frameon=False, color='annotated_celltype', size=10, linewidth=1.5, save="radial_glia_scatter.pdf")
scvelo_radial_glia_scatter

Figure 8. scvelo_radial_glia_scatter

scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence' 
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95], save="velocity_confidence.pdf")
scvelo_velocity_confidence

Figure 9. scvelo_velocity_confidence

scv.pl.velocity_graph(adata, threshold=.1, color='annotated_celltype', save="velocity_graph.pdf")
scvelo_velocity_graph

Figure 10. scvelo_velocity_graph

x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=70)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False, save="velocity_graph_lightgrey.pdf")
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax, save="scv_pl_scatter.pdf")
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot', save="velocity_pseudotime.pdf")
scvelo_velocity_pseudotime

Figure 11. scvelo_velocity_pseudotime

adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']
scv.tl.paga(adata, groups='annotated_celltype')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
#df.style.background_gradient(cmap='Blues').format('{:.2g}')

scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
            min_edge_width=2, node_size_scale=1.5)

8 🛳️ Lineage-Specific Velocity Analysis

When analyzing complex tissues with multiple distinct cellular lineages, a targeted approach focusing on specific cell populations can reveal nuanced biological insights that might be obscured in global analysis. In this section, we demonstrate:

  1. Population-Specific Analysis:
    • Isolation of neuronal lineage clusters
    • Focused examination of developmental trajectories
    • Enhanced resolution of lineage-specific dynamics
  2. Advanced Modeling Implementation:
    • Transition from steady-state to dynamical modeling
    • Integration of temporal gene regulation
    • Improved capture of transient states

This refined approach enables deeper insights into lineage-specific developmental programs and cellular state transitions.

cur_celltypes = ['early differentiating neurons', 'excitatory neurons', 'immature neurons', 'inhibitory neurons', 'neural progenitor cells', 'neurogenic radial glia', 'proliferating radial glia', 'radial glia']
adata_subset = adata[adata.obs['annotated_celltype'].isin(cur_celltypes)]

sc.pl.umap(adata_subset, color=['annotated_celltype', 'orig.ident'], frameon=False, title=['', ''])

sc.pp.neighbors(adata_subset, n_neighbors=15, use_rep='X_pca')
scv.pp.filter_and_normalize(adata_subset)
scv.pp.moments(adata_subset)


scv.tl.recover_dynamics(adata_subset)

scv.tl.velocity(adata_subset, mode='dynamical')
scv.tl.velocity_graph(adata_subset)

scv.pl.velocity_embedding_stream(adata_subset, basis='umap', color=['annotated_celltype', 'orig.ident'], save='embedding_stream.pdf', title='')
df = adata_subset.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]

kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
    pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs)
    pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
    pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)

scv.get_df(adata_subset, 'fit*', dropna=True).head()

scv.tl.latent_time(adata_subset)
scv.pl.scatter(adata_subset, color='latent_time', color_map='gnuplot', size=80)
top_genes = adata_subset.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata_subset, var_names=top_genes, sortby='latent_time', col_color='annotated_celltype', n_convolve=100)

top_genes = adata_subset.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata_subset, color='annotated_celltype', basis=top_genes[:15], ncols=5, frameon=False)

var_names = ['MTUS2', 'PPFIA2', 'NMNAT2']
scv.pl.scatter(adata_subset, var_names, color='annotated_celltype', frameon=False)
scv.pl.scatter(adata_subset, x='latent_time', y=var_names, color='annotated_celltype', frameon=False)

9 🪗 Conclusions and Future Perspectives

This comprehensive tutorial has demonstrated the power and versatility of RNA velocity analysis in single-cell transcriptomics using scVelo. We have explored:

  1. Core Analytical Capabilities:
    • Implementation of steady-state and dynamical velocity models
    • Identification of cell-state transitions
    • Detection of dynamically regulated genes
    • Reconstruction of developmental trajectories
  2. Advanced Applications:
    • Lineage-specific analysis
    • Pseudotime inference
    • Vector field visualization
    • Graph-based trajectory analysis
  3. Future Directions:
    • Integration with CellRank(Lange et al. 2022) for enhanced fate mapping
    • Multi-modal data integration possibilities
    • Advanced trajectory inference methods
    • Novel biological insight generation

As single-cell technologies continue to evolve, RNA velocity analysis stands as a fundamental tool for understanding cellular dynamics and developmental processes. The integration of these approaches with emerging computational methods promises to further enhance our understanding of complex biological systems.

References

Bergen, V., M. Lange, S. Peidli, F. A. Wolf, and F. J. Theis. 2020. “Generalizing RNA Velocity to Transient Cell States Through Dynamical Modeling.” Journal Article. Nat Biotechnol 38 (12): 1408–14. https://doi.org/10.1038/s41587-020-0591-3.
Hao, Y., S. Hao, E. Andersen-Nissen, 3rd Mauck W. M., S. Zheng, A. Butler, M. J. Lee, et al. 2021. “Integrated Analysis of Multimodal Single-Cell Data.” Journal Article. Cell 184 (13): 3573–3587 e29. https://doi.org/10.1016/j.cell.2021.04.048.
La Manno, G., R. Soldatov, A. Zeisel, E. Braun, H. Hochgerner, V. Petukhov, K. Lidschreiber, et al. 2018. “RNA Velocity of Single Cells.” Journal Article. Nature 560 (7719): 494–98. https://doi.org/10.1038/s41586-018-0414-6.
Lange, M., V. Bergen, M. Klein, M. Setty, B. Reuter, M. Bakhti, H. Lickert, et al. 2022. “CellRank for Directed Single-Cell Fate Mapping.” Journal Article. Nat Methods 19 (2): 159–70. https://doi.org/10.1038/s41592-021-01346-6.
Wolf, F. A., F. K. Hamey, M. Plass, J. Solana, J. S. Dahlin, B. Gottgens, N. Rajewsky, L. Simon, and F. J. Theis. 2019. “PAGA: Graph Abstraction Reconciles Clustering with Trajectory Inference Through a Topology Preserving Map of Single Cells.” Journal Article. Genome Biol 20 (1): 59. https://doi.org/10.1186/s13059-019-1663-x.